-
Notifications
You must be signed in to change notification settings - Fork 1
permutation tests #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- saved processed dataset to example_data - removed data cleaning code from notebook - added descriptions and permutation explanation in markdown
enryH
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will test it now.
| @@ -0,0 +1,53 @@ | |||
| # preprocessing | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to something similar to compositional.py - I would like to not have it specific to microbiome at best, but we can group it here for now. Maybe you could move the file as compositional.py to transform/compositional.py?
| @@ -0,0 +1,25 @@ | |||
| eff_id,inf_id,sampling_location,sampling_read,eff_abundance,inf_abundance | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a README.md to the folder briefly putting a link where this originates from? And maybe if it is a subset, a small preprocessing script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive permutation testing functionality to the acore library, including implementations for paired and independent sample tests, chi-squared tests, and Jaccard similarity calculations. The changes also include documentation, examples, and minor code quality improvements.
- Implements three main permutation test functions:
paired_permutation,indep_permutation, andchi2_permutation - Adds Jaccard similarity calculations for graph comparison
- Includes comprehensive test suite and tutorial documentation with real-world metagenomics example
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| src/acore/permutation_test/init.py | Core permutation test implementations with support for multiple metrics |
| src/acore/permutation_test/internal_functions.py | Helper functions for permutation and contingency table operations |
| src/acore/permutation_test/jaccard.py | Jaccard similarity calculation functions for set comparison |
| src/acore/types/permutation_test.py | Pydantic model for permutation test results |
| tests/test_permutation_test.py | Comprehensive test suite for permutation test functions |
| docs/api_examples/permutation_testing.py | Tutorial demonstrating permutation testing on metagenomics data |
| docs/api_examples/permutation_testing.ipynb | Jupyter notebook version of the tutorial |
| docs/index.rst | Updated documentation index to include permutation testing tutorial |
| src/acore/microbiome/internal_functions.py | New CLR transformation functions for compositional data |
| example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv | Example dataset for tutorial |
| src/acore/multiple_testing/init.py | Code style fix: changed membership test to use not in operator |
| src/acore/enrichment_analysis/init.py | Code style fix: changed membership tests to use not in operator |
| src/acore/decomposition/umap.py | Added blank line for formatting consistency |
| src/acore/io/uniprot/uniprot.py | Removed extra blank lines |
| CONTRIBUTING.rst | Fixed installation command quoting and corrected PR template path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print(result) | ||
| print(ttest_rel(cond1, cond2)) |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements should be removed from the test code. These print statements (lines 50-51) will clutter the test output.
| print(result) | |
| print(ttest_rel(cond1, cond2)) |
| # | ||
| # In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8). | ||
| # | ||
| # The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058). |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".
| # The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058). | |
| # The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058). |
| # ## Data preparation details | ||
| # | ||
| # ### Downloading | ||
| # The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing. |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".
| # | ||
| # The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. | ||
| # | ||
| # If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "sizze" should be "size".
| # If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. | |
| # If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. |
| "## Data preparation details\n", | ||
| "\n", | ||
| "### Downloading\n", | ||
| "The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n", |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".
| def random_choice_equal(): | ||
| return np.random.choice( | ||
| ["A", "B", "C", "D"], size=100, replace=True, p=[0.25, 0.25, 0.25, 0.25] | ||
| ) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def random_choice_unequal(): | ||
| return np.random.choice( | ||
| ["A", "B", "C", "D"], size=100, replace=True, p=[0.1, 0.25, 0.05, 0.6] | ||
| ) |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test fixtures that use random number generation without a fixed seed can lead to non-reproducible test failures. Consider using np.random.default_rng(seed=...) or np.random.seed() to ensure reproducibility.
| # %% | ||
| from acore.permutation_test import paired_permutation | ||
|
|
||
| # optional choice of random number generatorfor repro |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space in comment: "generatorfor" should be "generator for".
| # optional choice of random number generatorfor repro | |
| # optional choice of random number generator for repro |
| "\n", | ||
| "In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n", | ||
| "\n", | ||
| "The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n", |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".
| "The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n", | |
| "The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).\n", |
| "source": [ | ||
| "from acore.permutation_test import paired_permutation\n", | ||
| "\n", | ||
| "# optional choice of random number generatorfor repro\n", |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space in comment: "generatorfor" should be "generator for".
| "# optional choice of random number generatorfor repro\n", | |
| "# optional choice of random number generator for repro\n", |
| from acore import permutation_test as pt | ||
| import pytest | ||
| import numpy as np | ||
| from scipy.stats import ttest_rel, ttest_ind, mannwhitneyu, wilcoxon |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'ttest_ind' is not used.
Import of 'mannwhitneyu' is not used.
Import of 'wilcoxon' is not used.
| from scipy.stats import ttest_rel, ttest_ind, mannwhitneyu, wilcoxon | |
| from scipy.stats import ttest_rel |
|
@angelphanth Definitely 'squash and merge` this PR. |
Co-authored-by: Henry Webel <[email protected]>
Co-authored-by: Henry Webel <[email protected]>
Description
Add a subpackage (as new model) to core.
Tasks Checklist
__init__.pyin the new folder, so thatthey are available when the subpackage is imported.
Create Pandera schema in a file with subpackage name in thesrc/acore/typesfolder.Optimal is to have only one output schema of results per subpackage or module.
datafolder, or reuse an existing one for testingdocs/api_examplesfolder with that dataindex.rstfile in thedocsfolder with the new example/testfolder with the name of the subpackage or moduleusing pytest or unittests to test your new functionality.